[blog] Introducing `inf2` runtime blog post #540

spillai · 2024-02-01T07:52:18Z

Summary

Added blog post for inf2 runtime
Updated README with cleaner API / usage and value-props

Related issues

Checks

make lint: I've run make lint to lint the changes in this PR.
make test: I've made sure the tests (make test-cpu or make test) are passing.
Additional tests:
- Benchmark tests (when contributing new models)
- GPU/HW tests

outtanames

Looks Good.

outtanames · 2024-02-01T18:43:18Z

docs/blog/posts/04-introducing-inf2-runtime.md

+
+[AWS Inferentia2](https://aws.amazon.com/en/ec2/instance-types/inf2/) (Inf2 for short) is the second-generation inference accelerator from AWS. Inf2 instances raise the performance of Inf1 (originally launched in 2019) by delivering 3x higher compute performance, 4x larger total accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators. 
+
+Relative to the [AWS G5 instances](https://aws.amazon.com/ec2/instance-types/g5/) ([NVIDIA A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/)), Inf2 instances promise up to 50% better performance-per-watt. Inf2 instances are ideal for applications such as natural language processing, recommender systems, image classification and recognition, speech recognition, and language translation that can take advantage of scale-out distributed inference. 


should we quote the inf2 numbers here against say an A100 for reference?

Will do this once we have the profiling numbers to compare, I could only find this stat relative to G5.

outtanames · 2024-02-01T18:44:17Z

docs/blog/posts/04-introducing-inf2-runtime.md

+
+## 📦 Deploying a model on Inferentia2 with NOS
+
+Deploying models on AWS Inferentia2 chips presents a unique set of challenges, distinctly different from the experience with NVIDIA GPUs. This is primarily due to the lack of a mature toolchain for compiling, profiling, and deploying models onto these specialized ASICs. To effectively utilize the AWS Inferentia2 chips, custom model tracing and compilation are essential steps. This process demands a deep understanding of the deployment toolchain, including PyTorch IR op-support and the [AWS Neuron SDK](https://github.com/aws-neuron/aws-neuron-sdk), to optimize model performance fully. NOS aims to bridge this gap and streamline the deployment process, making it more accessible for developers to leverage the powerful inference capabilities of AWS Inferentia2 for their inference workloads and expose easy-to-use gRPC/RESTful services in a straightforward manner.


[nit] distinctly different -> distinct. Also wouldn't say 'lack of mature toolchain', maybe just point out that the Neuron SDK is very differnt from the torch cuda ecosystem.

outtanames · 2024-02-01T18:47:55Z

docs/blog/posts/04-introducing-inf2-runtime.md

+| Model | Cloud Instance | Spot | Cost / hr | Cost / month | # of Req. / $ | 
+| ----- | -------------- | ---- | --------- | ------------ | ---------- |
+| [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) | `inf2.xlarge` | - | $0.75 | ~$540 | ~685K / $1 | 
+| **[BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5)** | `inf2.xlarge` | ✅ | **$0.32** | **~$230** | ~1.6M / $1 | 


spillai added docs Indicates a need for improvements or additions to documentation blog inf2 labels Feb 1, 2024

spillai requested review from jiexiong2016 and outtanames February 1, 2024 07:52

spillai self-assigned this Feb 1, 2024

spillai force-pushed the spillai/0.2.0-dev-inf2-blog branch from 094fac5 to 05d82b7 Compare February 1, 2024 18:32

outtanames approved these changes Feb 1, 2024

View reviewed changes

spillai force-pushed the spillai/0.2.0-dev-inf2-blog branch from 05d82b7 to c023b46 Compare February 1, 2024 19:02

[blog] Introducing inf2 runtime blog post

3250c5f

spillai force-pushed the spillai/0.2.0-dev-inf2-blog branch from c023b46 to 3250c5f Compare February 1, 2024 20:38

spillai merged commit 47a0776 into autonomi-ai:main Feb 1, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[blog] Introducing `inf2` runtime blog post #540

[blog] Introducing `inf2` runtime blog post #540

spillai commented Feb 1, 2024 •

edited

Loading

outtanames left a comment

outtanames Feb 1, 2024

spillai Feb 1, 2024

outtanames Feb 1, 2024

outtanames Feb 1, 2024


		[AWS Inferentia2](https://aws.amazon.com/en/ec2/instance-types/inf2/) (Inf2 for short) is the second-generation inference accelerator from AWS. Inf2 instances raise the performance of Inf1 (originally launched in 2019) by delivering 3x higher compute performance, 4x larger total accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators.

		Relative to the [AWS G5 instances](https://aws.amazon.com/ec2/instance-types/g5/) ([NVIDIA A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/)), Inf2 instances promise up to 50% better performance-per-watt. Inf2 instances are ideal for applications such as natural language processing, recommender systems, image classification and recognition, speech recognition, and language translation that can take advantage of scale-out distributed inference.


		## 📦 Deploying a model on Inferentia2 with NOS

		Deploying models on AWS Inferentia2 chips presents a unique set of challenges, distinctly different from the experience with NVIDIA GPUs. This is primarily due to the lack of a mature toolchain for compiling, profiling, and deploying models onto these specialized ASICs. To effectively utilize the AWS Inferentia2 chips, custom model tracing and compilation are essential steps. This process demands a deep understanding of the deployment toolchain, including PyTorch IR op-support and the [AWS Neuron SDK](https://github.com/aws-neuron/aws-neuron-sdk), to optimize model performance fully. NOS aims to bridge this gap and streamline the deployment process, making it more accessible for developers to leverage the powerful inference capabilities of AWS Inferentia2 for their inference workloads and expose easy-to-use gRPC/RESTful services in a straightforward manner.

[blog] Introducing inf2 runtime blog post #540

[blog] Introducing inf2 runtime blog post #540

Conversation

spillai commented Feb 1, 2024 • edited Loading

Summary

Related issues

Checks

outtanames left a comment

Choose a reason for hiding this comment

outtanames Feb 1, 2024

Choose a reason for hiding this comment

spillai Feb 1, 2024

Choose a reason for hiding this comment

outtanames Feb 1, 2024

Choose a reason for hiding this comment

outtanames Feb 1, 2024

Choose a reason for hiding this comment

[blog] Introducing `inf2` runtime blog post #540

[blog] Introducing `inf2` runtime blog post #540

spillai commented Feb 1, 2024 •

edited

Loading